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Abstract 

Detection of emerging topics are now receiving renewed interest motivated by the rapid growth of 
social networks. Conventional term-frequency-based approaches may not be appropriate in this context, 
because the information exchanged are not only texts but also images, URLs, and videos. We focus on 
the social aspects of theses networks. That is, the links between users that are generated dynamically 
intentionally or unintentionally through replies, mentions, and retweets. We propose a probability model 
of the mentioning behaviour of a social network user, and propose to detect the emergence of a new topic 
from the anomaly measured through the model. We combine the proposed mention anomaly score with 
a recently proposed change-point detection technique based on the Sequentially Discounting Normalized 
Maximum Likelihood (SDNML), or with Kleinberg's burst model. Aggregating anomaly scores from 
hundreds of users, we show that we can detect emerging topics only based on the reply/mention relation- 
ships in social network posts. We demonstrate our technique in a number of real data sets we gathered 
from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect 
new topics at least as early as the conventional term-frequency-based approach, and sometimes much 
earlier when the keyword is ill-defined. 

Keywords: Topic Detection, Anomaly Detection, Social Networks, Sequentially Discounted Maximum 
Likelihood Coding, Burst detection 



1 Introduction 

Communication through social networks, such as Facebook and Twitter, is increasing its importance in our 
daily life. Since the information exchanged over social networks are not only texts but also URLs, images, 
and videos, they are challenging test beds for the study of data mining. 
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Figure 1: Example of the emergence of a topic in social streams. 



There is another type of information that is intentionally or unintentionally exchanged over social net- 
works: mentions. Here we mean by mentions links to other users of the same social network in the form 
of message-to, reply-to, retweet-of, or explicitly in the text. One post may contain a number of mentions. 
Some users may include mentions in their posts rarely; other users may be mentioning their friends all the 
time. Some users (like celebrities) may receive mentions every minute; for others, being mentioned might be 
a rare occasion. In this sense, mention is like a language with the number of words equal to the number of 
users in a social network. 

We are interested in detecting emerging topics from social network streams based on monitoring the 
mentioning behaviour of users. Our basic assumption is that a new (emerging) topic is something people feel 
like discussing about, commenting about, or forwarding the information further to their friends. Conventional 
approaches for topic detection have mainly been concerned with the frequencies of (textual) words [1] [5] . A 
term frequency based approach could suffer from the ambiguity caused by synonyms or homonyms. It may 
also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, 
it cannot be applied when the contents of the messages are mostly non-textual information. On the other 
hands, the "words" formed by mentions are unique, requires little prepossessing to obtain (the information 
is often separated from the contents), and are available regardless of the nature of the contents. 

Figure [T] shows an example of the emergence of a topic through posts on social networks. The first post 
by Bob contains mentions to Alice and John, which are both probably friends of Bob's; so there is nothing 
unusual here. The second post by John is a reply to Bob but it is also visible to many friends of John's 
that are not direct friends of Bob's. Then in the third post, Dave, one of John's friends, forwards (called 
retweet in Twitter) the information further down to his own friends. It is worth mentioning that it is not 
clear what the topic of this conversation is about from the textual information, because they are talking 
about something (a new gadget, car, or jewelry) that is shown as a link in the text. 

In this paper, we propose a probability model that can capture the normal mentioning behaviour of a user, 
which consists of both the number of mentions per post and the frequency of users occurring in the mentions. 
Then this model is used to measure the anomaly of future user behaviour. Using the proposed probability 
model, we can quantitatively measure the novelty or possible impact of a post reflected in the mentioning 
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behaviour of the user. We aggregate the anomaly scores obtained in this way over hundreds of users and apply 
a recently proposed change-point detection technique based on the Sequentially Discounting Normalized 
Maximum Likelihood (SDNML) coding [3]. This technique can detect a change in the statistical dependence 
structure in the time series of aggregated anomaly scores, and pin-point where the topic emergence is; see 
Figure [2j The effectiveness of the proposed approach is demonstrated on four data sets we have collected 
from Twitter. We show that our approach can detect the emergence of a new topic at least as fast as using the 
best term that was not obvious at the moment. Furthermore, we show that in two out of four data sets, the 
proposed link-anomaly based method can detect the emergence of the topics earlier than keyword-frequency 
based methods, which can be explained by the keyword ambiguity we mentioned above. 

2 Related work 

Detection and tracking of topics have been studied extensively in the area of topic detection and tracking 
(TDT) 1 . In this context, the main task is to either classify a new document into one of the known topics 
(tracking) or to detect that it belongs to none of the known categories. Subsequently, temporal structure of 
topics have been modeled and analyzed through dynamic model selection [?], temporal text mining [5], and 
factorial hidden Markov models [6 . 

Another line of research is concerned with formalizing the notion of "bursts" in a stream of documents. 
In his seminal paper, Kleinberg modeled bursts using time varying Poisson process with a hidden discrete 
process that controls the firing rate [2] . Recently, He and Parker developed a physics inspired model of bursts 
based on the change in the momentum of topics [7] . 

All the above mentioned studies make use of textual content of the documents, but not the social content 
of the documents. The social content (links) have been utilized in the study of citation networks [8 . However, 
citation networks are often analyzed in a stationary setting. 

The novelty of the current paper lies in focusing on the social content of the documents (posts) and in 
combining this with a change-point analysis. 

3 Proposed Method 

The overall flow of the proposed method is shown in Figure [2] We assume that the data arrives from a social 
network service in a sequential manner through some API. For each new post we use samples within the 
past T time interval for the corresponding user for training the mention model we propose below. We assign 
anomaly score to each post based on the learned probability distribution. The score is then aggregated over 
users and further fed into a change-point analysis. 

3.1 Probability Model 

We characterize a post in a social network stream by the number of mentions k it contains, and the set V 
of names (IDs) of the users mentioned in the post. Formally, we consider the following joint probability 
distribution 

P(fc,F|0,K}) = F(fc|0) [] (1) 

Here the joint distribution consists of two parts: the probability of the number of mentions k and the 
probability of each mention given the number of mentions. The probability of the number of mentions 



3 



Social 

Network 

Service 



s 



Bob's Jolin's 
new post new post 



API 



It T 

' — y ^ 



Time 



/ ^raining 

Bob's posts 



Jo hn's posts 




f Learn 
mention 
distribution 
- of Bob 



Learn 
mention 
distribution 
of John 



Compute individual 
anomaly scores 



■> 

Time 



Aggregate 




Change-point analysis 
(SDNIVIL) 



_1 



Time 



Figure 2: Overall flow of the proposed method. 



P{k\d) is defined as a geometric distribution with parameter 9 as follows: 

P{k\e)^{l-9)''9. (2) 

On the other hand, the probability of mentioning users in V is defined as independent, identical multinomial 
distribution with parameters 7r„ (^^t^v = !)■ 

Suppose that we are given n training examples T — {(fci, Vi), . . ., (fc„, Vn)} from which we would like to 
learn the predictive distribution 

Pik,V\T)^P{k\T)l[P{v\T). (3) 

First we compute the predictive distribution with respect to the the number of mentions P(k\T). This can 
be obtained by assuming a beta distribution as a prior and integrating out the parameter 9. The density 
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function of the beta prior distribution is written as follows: 

(l_0)/3-l0"~l 



p{0\a,P) 



B{a,p) 



where a and /3 are parameters of the beta distribution and B(a,/3) is the beta function. By the Bayes rule, 
the predictive distribution can be obtained as follows: 

P(fc|r,a,/3) = P(fc|A:i,...,fc„,a,/3) 
_ P{k,ki, . . .,k„\a,/3) 
P(fci,...,fc„|a,/3) 

_ J^i(l_0)Er=ifc.+fc+/5-ir+i+"-id0 

~ (1 - 6')i:r=l k,+l3-lQn+a-1^0 

Both the integrals on the numerator and denominator can be obtained in closed forms as beta functions and 
the predictive distribution can be rewritten as follows: 

Using the relation between beta function and gamma function, we can further simplify the expression as 
follows: 

P(fc|r, a, P) = " + " ^ 17 + ^ + , (4) 

where m = total number of mentions in the training set T. 

Next, we derive the predictive distribution P{v\T) of mentioning user v. The maximum likelihood (ML) 
estimator is given as P{v\T) = my/m, where m is the number of total mentions and my is the number of 
mentions to user v in the data set T. The ML estimator, however, cannot handle users that did not appear in 
the training set T; it would assign probability zero to all these users, which would appear infinitely anomalous 
in our framework. Instead we use the Chinese Restaurant Process (CRP; see [3]) based estimation. The 
CRP based estimator assigns probability to each user v that is proportional to the number of mentions ruy 
in the training set T; in addition, it keeps probability proportional to 7 for mentioning someone who was 
not mentioned in the training set T. Accordingly the probability of known users is given as follows: 

P{v\T) - — (for v. my > 1). (5) 

m + 7 

On the other hand, the probability of mentioning a new user is given as follows: 

P{{v:my^O}\T) = ^^. (6) 
m + 7 

3.2 Computing the link-anomaly score 

In order to compute the anomaly score of a new post x = (t, u, k, V) by user u at time t containing k 
mentions to users V, we compute the probability ^ with the training set Tu^\ which is the collection of 
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posts by user u in the time period [t — T, t] (we use T — iO days in this paper). Accordingly the hnk- anomaly 
score is defined as follows: 

s{x) = - log (p(A:|ri*)) n pM'A 

= -iogP(fc|ri*)) - ^ogP{v\Ti'^). (7) 

The two terms in the above equation can be computed via the predictive distribution of the number of 
mentions ([1]), and the predictive distribution of the mentionee ([S])-®, respectively. 

3.3 Combining Anomaly Scores from Different Users 

The anomaly score in ^ is computed for each user depending on the current post of user u and his/her 
past behaviour Tu^^ ■ In order to measure the general trend of user behaviour, we propose to aggregate the 
anomaly scores obtained for posts Xi, . . . , a;„ using a discretization of window size t > as follows: 

4 = 7 ^(^')' (8) 

where Xi = (ti,Ui,ki,Vi) is the post at time ti by user Ui including ki mentions to users Vi. 

3.4 Change-point detection via Sequentially Discounting Normalized Maximum 
Likelihood Coding 

Given an aggregated measure of anomaly ([5]), we apply a change-point detection technique based on the 
SDNML coding [3]. This technique detects a change in the statistical dependence structure of a time series 
by monitoring the compressibility of the new piece of data. The sequential version of normalized maximum 
likelihood (NML) coding is employed as a coding criterion. More precisely, a change point is detected 
through two layers of scoring processes (see also [IHIIII]); in each layer, the SDNML code length based on 
an autoregressive (AR) model is used as a criterion for scoring. Although the NML code length is known to 
be optimal [12], it is often hard to compute. The SNML proposed in [13] is an approximation to the NML 
code length that can be computed in a sequential manner. The SDNML proposed in [3] further employs 
discounting in the learning of the AR models. 

Algorithmically, the change point detection procedure can be outlined as follows. For convenience, we 
denote the aggregate anomaly score as Xj instead of s'^ . 

1. 1st layer learning Let x^^^ {xi, . . . ,Xj-i} be the collection of aggregate anomaly scores from dis- 

crete time 1 to j — 1. Sequentially learn the SDNML density function PsDNML(a;j l^^"'""'^) (j = 1, 2, . . .); 
see Appendix \K\ for details. 

2. 1st layer scoring Compute the intermediate change-point score by smoothing the log loss of the SD- 

NML density function with window size k as follows: 

1 ^ 

Vj = - Y (-logPsDNML(a;j|x-'"^)) . 

j'=j-K+l 
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Algorithm 1 Dynamic Threshold Optimization (DTO) [14^ 



Given: {Scorej\j — 1,2,...}: scores, Nh'- total number of cells, p: parameter for threshold, Xh'- estimation 
parameter, th'- discounting parameter, M: data size 

Initialization: Let gj;^' (h) (a weighted sufficient statistics) be a uniform distribution, 
for j = 1, . . . , A/ - 1 do 

Threshold optimization: Let I be the least index such that X^L^i <l^''\h) > 1 — p. The threshold at time j is 
given as 

'^(^■) = "+ilr^(^ + ^)- 

Alarm output: Raise an alarm if Scorej > r]{j). 
Histogram update: 



!if Scorej falls 
{l-rH)q--{'\h)^rH into the hth 
cell, 
(l—rH)qi\h) otherwise. 

W=(g(^'+^' (ft) + A«)/(E,gF^''W + Nh\h). 

end for 



4. 2nd layer learning Let y^~^ := {j/i, . . . , t/j_i} be the collection of smoothed change-point score ob- 

tained as above. Sequentially learn the second layer SDNML density fmiction PsTiKmi.{yj\y^~^) 
(i = 1,2,.. .); see Appendix \K\ for details. 

5. 2nd layer scoring Compute the hnal change-point score by smoothing the log loss of the SDNML 

density function as follows: 

1 

Score{yj)^- V {-\ogpsoNML{yjW~^)) ■ (9) 

K 

j'=j-K. + l 

3.5 Dynamic Threshold Optimization (DTO) 

We make an alarm if the change-point score exceeds a threshold, which was determined adaptively using the 
method of dynamic threshold optimization (DTO), proposed in [TJ. 

In DTO, we use a 1-dimensional histogram for the representation of the score distribution. We learn 
it in a sequential and discounting way. Then, for a specified value p, to determine the threshold to be the 
largest score value such that the tail probability beyond the value does not exceed p. We call p a threshold 
parameter. 

The details of DTO are summarized as follows: Let Nh be a given positive integer. Let {q{h)(h — 
1, . . . , Nh) ■ l(^) = 1} be a 1- dimensional histogram with Nh bins where h is an index of bins, with 

a smaller index indicating a bin having a smaller score. For given a, b such that a < b, Nh bins in the 
histogram are set as: {(-oo,a); [a + {{b - a)/ {Nh ~ 2)}£, [a + {{b - a)/{NH - 2)}{£ + 1){£ = 0, 1,...,Nh-3) 
and [b, oo). Let {q^^\h)} be a histogram updated after seeing the jth score. The procedures of updating the 
histogram and DTO are given in Algorithm [TJ 
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Table 1: Number of participants in each data set. 



data set 


tt of participants 


"Job hunting" 


200 


"Youtube" 


160 


"NASA" 


90 


"BBC" 


47 



4 Experiments 

4.1 Experimental setup 

We collected four data sets from Twitter. Each data set is associated with a list of posts in a service 
called Togettei0; Togetter is a collaborative service where people can tag Twitter posts that are related to 
each other and organize a list of posts that belong to a certain topic. Our goal is to evaluate whether the 
proposed approach can detect the emergence of the topics recognized and collected by people. We have 
selected four data sets, "Job hunting" , "Youtube" , "NASA" , "BBC" each corresponding to a user organized 
list in Togetter. For each data set we collected posts from users that appeared in each list (participants). 
The number of participants in each data set is different; see Table [TJ 

We compared our proposed approach with a keyword-based change-point detection method. In the 
keyword-based method, we looked at a sequence of occurrence frequencies (observed within one minute) 
of a keyword related to the topic; the keyword was manually selected to best capture the topic. Then we 
applied DTO described in Section [3.5l to the sequence of keyword frequency. In our experience, the sparsity 
of the keyword frequency seems to be a bad combination with the SDNML method; therefore we did not use 
SDNML in the keyword-based method. We use the smoothing parameter k = 15, and the order of the AR 
model 30 in the experiments; the parameters in DTO was set as p = 0.05, Nh = 20, Xh = 0.01, rn = 0.005. 

Furthermore, we have implemented a two-state version of Kleinberg's burst detection model [2j using 
link-anomaly score ([8]) and keyword frequency (as in the keyword-based change-point analysis) to filter out 
relevant posts. For the link-anomaly score, we used a threshold to filter out posts to include in the burst 
analysis. For the keyword frequency, we used all posts that include the keyword for the burst analysis. We 
used the firing rate parameter of the Poisson point process 0.001 (1/s) for the non-burst state and 0.01 (1/s) 
for the burst state, and the transition probability p — 0.3. We consider the transition from the non-burst 
state to the burst state as an "alarm" . 

A drawback of the keyword-based methods (dynamic thresholding and burst detection) is that the key- 
word related to the topic must be known in advance, although this is not always the case in practice. The 
change-point detected by the keyword-based methods can be thought of as the time when the topic really 
emerges. Hence our goal is to detect emerging topics as early as the keyword based methods. 

4.2 "Job hunting" data set 

This data set is related to a controversial post by a famous person in Japan that "the reason students having 
difficulty finding jobs is, because they are stupid" and various replies to that post. 

^http:/ /togetter. com/ 
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Table 2: Detection time and the number of detections. The first detection time is defined as the time of the 
first alert after the event/post that initiated each topic; see captions for Figures [3] [B] for the details. 



Method 




"Job hunting" 


"Youtube" 


"NASA" 


"BBC" 


Link-anomaly-based 
change-point detection 


[j of detections 


4 


4 


14 


3 


1st detection time 


22:55, Jan 08 


08:44, Nov 05 


20:11, Dec 02 


19:52, Jan 21 


Keyword- frequency-based 
change-point detection 


[j of detections 


1 


1 


1 


1 


1st detection time 


22:57, Jan 08 


00:30, Nov 05 


04:10, Dec 03 


22:41, Jan 21 


Link-anomaly-based 
burst detection 


[j of detections 


1 


9 


25 


2 


1st detection time 


23:07, Jan 08 


00:07, Nov 05 


00:44, Nov 30 


20:51, Jan 21 


Keyword- frequency-based 
burst detection 


[j of detections 


6 


15 


11 


1 


1st detection time 


22:50, Jan 08 


23:59, Nov 04 


08:34, Dec 03 


22:32, Jan 21 



The keyword used in the keyword-based methods was "Job hunting." Figures 3(a) and |3(b)| show the 
results of the proposed link-anomaly-based change detection and burst detection, respectively. Figures 3(c) 
and |3(d)"] show the results of the key word- frequency-based change detection and burst detection, respectively. 

The first alarm time of the proposed link-anomaly-based change-point analysis was 22:55, whereas that 
for the keyword- frequency-based counterpart was 22:57; see also Tabled The earliest detection was achieved 
by the keyword-frequency-based burst detection method. Nevertheless, from Figure |3l we can observe that 
the proposed link-anomaly-based methods were able to detect the emerging topic almost as early as keyword- 
frequency-based methods. 



4.3 "Youtube" data set 

This data set is related to the recent leakage of some confidential video by the Japan Coastal Guard officer. 

The keyword used in the keyword-based methods is "Senkaku." Figures 4(a) and 4(b)| show the results 
of link-anomaly-based change detection and burst detection, respectively. Figures 4(c) and |4(d)| show the 
results of keyword-frequency based change detection and burst detection, respectively. 

The first alarm time of the proposed link-anomaly-based change-point analysis was 08:44, whereas that 
for the keyword-based counterpart was 00:30; see also Table [2j Although the aggregated anomaly score ([8]) 
in Figure 4(a) around midnight, Nov 05 is elevated, it seems that SDNML fails to detect this elevation as a 
change point. In fact, the link-anomaly-based burst detection (Figure [4(b) | raised an alarm at 00:07, which 
is earlier than the keyword-frequency-based change-point analysis and closer to the the keyword-frequency- 
based burst detection at 23:59, Nov 04. 



4.4 "NASA" data set 

This data set is related to the discussion among Twitter users interested in astronomy that preceded NASA's 
press conference about discovery of an arsenic eating organism. 

The keyword used in the keyword-based models is "arsenic." Figures 5(a) and 



5(b) 



show the results of 



5(d) show the same 



link- anomaly-based change detection and burst detection, respectively. Figures 5(c) and 
results for the keyword-frequency-based methods. 

The first alarm times of the two link-anomaly-based methods were 20:11, Dec 02 (change-point detection) 
and 00:44, Nov 30 (burst detection), respectively. Both of these are earlier than NASA's official press 
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conference (04:00, Dec 03) and are earlier than the keyword-frequency based methods (change-point detection 
at 04:10, Dec 03 and burst detection at 08:34, Dec 03.); see TableU 



4.5 "BBC" data set 



This data set is related to angry reactions among Japanese Twitter users against a BBC comedy show that 
asked "who is the unluckiest person in the world" (the answer is a Japanese man who got hit by nuclear 
bombs in both Hiroshima and Nagasaki but survived). 

The keyword used in the keyword-based models is "British" (or "Britain"). Figures 6(a) and 6(b) [ show 
the results of link-anomaly-based change detection and burst detection, respectively. Figures |6(c)| and [6(d)] 
show the same results for the keyword-frequency-based methods. 

The first alarm time of the two link-anomaly-based methods was 19:52 (change-point detection) and 
20:51 (burst detection), both of which are earlier than the keyword-frequency-based counterparts at 22:41 
(change-point detection) and 22:32 (burst detection). See Tabled) 



4.6 Discussion 



Within the four data sets we have analyzed above, the proposed link-anomaly based methods compared 
favorably against the keyword-frequency based methods on "NASA" and "BBC" data sets. On the other 
hand, the keyword-frequency based methods were earlier to detect the topics on "Job hunting" and "Youtube" 
data sets. 

The above observation is natural, because for "Job hunting" and "Youtube" data sets, the keywords 
seemed to have been unambiguously defined from the beginning of the emergence of the topics, whereas for 
"NASA" and "BBC" data sets, the keywords are more ambiguous. In particular, in the case of "NASA" 
data set, people had been mentioning "arsenic" eating organism earlier than NASA's official release but 



only rarely (see Figure 5(d)). Thus, the keyword- frequency-based methods could not detect the keyword as 
an emerging topic, although the keyword "arsenic" appeared earlier than the official release. For "BBC" 
data set, the proposed link- anomaly-based burst model detects two bursty areas (Figure [6 (b)[ ). Interestingly, 
the link- anomaly-based change-point analysis only finds the first area (Figure 6(a)), whereas the keyword- 
frequency-based methods only find the second area (Figures 6(c) and 6(d)). This is probably because there 
was an initial stage where people reacted individually using different words and later there was another stage 
in which the keywords are more unified. 

In our approach, the alarm was raised if the change-point score exceeded a dynamically optimized thresh- 
old based on the significance level parameter p. Table |3] shows results for a number of threshold parameter 
values. We see that as p increased, the number of false alarms also increased. Meanwhile, even when it was 
so small, our approach was still able to detect the emerging topics as early as the keyword-based methods. 
We set p = 0.05 as a default parameter value in our experiment. Although there are several alarms for 
"NASA" data set, most of them are more or less related to the emerging topic. 

Notice again that in the keyword-based methods the keyword related to the topic must be known in 
advance, which is not always the case in practice. Further note that our approach only uses links (mentions), 
hence it can be applied to the case where topics are concerned with information other than texts, such as 
images, video, sounds, etc. 
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Table 3: Number of alarms for the proposed change-point detection method based on the link-anomaly 
score ([5]) for various significance level parameter values p. 



p 


"Job hunting" 


"Youtube" 


"NASA" 


"BBC" 


0.01 


4 


2 


9 


3 


0.05 


4 


4 


14 


3 


0.1 


8 


6 


30 


3 



5 Conclusion 

In this paper, we have proposed a new approach to detect the emergence of topics in a social network stream. 
The basic idea of our approach is to focus on the social aspect of the posts reflected in the mentioning 
behaviour of users instead of the textual contents. We have proposed a probability model that captures both 
the number of mentions per post and the frequency of mentionee. We have combined the proposed mention 
model with the SDNML change-point detection algorithm 3 and Kleinberg's burst detection model 2 to 
pin-point the emergence of a topic. 

We have applied the proposed approach to four real data sets we have collected from Twitter. The four 
data sets included a wide-spread discussion about a controversial topic ("Job hunting" data set), a quick 
propagation of news about a video leaked on Youtube ("Youtube" data set), a rumor about the upcoming 
press conference by NASA ("NASA" data set), and an angry response to a foreign TV show ("BBC" data 
set). In all the data sets our proposed approach showed promising performance. In most data set, the 
detection by the proposed approach was as early as term- frequency based approaches in the hindsight of the 
keyword that best describes the topic that we have manually chosen afterwards. Furthermore, for "NASA" 
and "BBC" data sets, in which the keyword that defines the topic is more ambiguous than the first two data 
sets, the proposed link-anomaly based approaches have detected the emergence of the topics much earlier 
than the keyword-based approaches. 

All the analysis presented in this paper was conducted off-line but the framework itself can be applied on- 
line. We are planning to scale up the proposed approach to handle social streams in real time. It would also 
be interesting to combine the proposed link-anomaly model with content-based topic detection approaches 
to further boost the performance and reduce false alarms. 
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A Sequentially discounting normalized maximum likelihood cod- 
ing 

This section describes the sequentially discounting normalized maximum likelihood (SDNML) coding that 
we use for change-point detection in Section [5^ The basic idea behind SDNML- based change detection is 
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as follows: when the data arrives in a sequential manner, we can consider a change has occurred if a new 
piece of data cannot be compressed using the statistical nature of the past. The original paper [101 HI] used 
the predictive stochastic complexity as a measure of compressibility, whereas Urabe et al. [3] proposed to 
employ a tighter coding scheme based on the SDNML. 

Suppose that we observe a discrete time series a;t (t = 1, 2, . . . ,); we denote the data sequence by a:* := 
xi_ ■ ■ - Xt- Consider the parametric class of conditional probability densities T = {p{xt\x*~^ : 6) : 6 € R^}, 
where 6 is the p-dimensional parameter vector and we assume a;° to be an empty set. We denote the maximum 
likelihood (ML) estimator given the data sequence x* by 6{x*)] i.e., 9{x*) :— argmaxg^j^p n*=i^'(^jl^"'~^ ■ 
6). The sequential normalized maximum likelihood (SNML) model is a coding distribution (see e.g., [13]) 
that is known to be optimal in the sense of the conditional minimax jl6| problem: 

min max ( — \ogq(xt\x*~^) + max logp(x* : 9)] , (10) 
(j(-|x*-i) Xt \ eeRp J 

where p{x^) '■= Y[*j^iPi^j\^''~^ '■ ^) the joint density over a;* induced by the conditional densities from J-. 
The minimization is taken over all conditional density functions and tries to minimize the regret ((TU)) over 
any possible outcome of the new sample xt. 

The SNML distribution is obtained as the optimal conditional density of the minimax problem (|10p as 
follows [TB] : 

_ p{x' : e{^')) 

Psum-L\Xt\X ) .— , . , , (LL) 

Kt[X^ ^) 

where the normalization constant Kt{x^~^) := / p{x*'\6{x^))dxt is necessary because the new sample xt is 
used in the estimation of parameter vector 0{x*) and the numerator in ()lip is not a proper density function. 
We call the quantity — \ogpsNmi.{xt\x''~'^) the SNML code-length. It is known from [T6l[l3] that the cumulative 
SNML code-length, which is the sum of SNML code-length over the sequence, is optimal in the sense that it 
asymptotically achieves the shortest code-length. 

The sequentially discounting normalized maximum likelihood (SDNML) is obtained by applying the 
above SNML to the class of autoregressive (AR) model and replacing the ML estimation in (|lip with a 
discounted ML estimation, which makes the SDNML-based change-point detection algorithm more flexible 
than an SNML-based one. Let a:t £ R for each t. We define the pth order AR model as follows: 



p{xt\x\ 




where 6^ = (a^,CT^) = [[of-^\ ■ ■ ■ ,a'-^^),CT^) is the parameter vector. 

In order to compute the SDNML density function we need the discounted ML estimators of the parameters 
in 0. We define the discounted ML estimator of the regression coefficient at as follows: 

t 

at = argmin \^ "Wt-j {xj — a^ Xj^ , (12) 

aeRP .,11 

where Wji = r(l — r)-' is a sequence of sample weights with the discounting coefficient r {0 < r < 1); to 
is the smallest number of samples such that the minimizer (|12l) is unique; Xj := {xj-i,Xj^2, ■ ■ ■ ,Xj-k)^ ■ 
Note that the error terms from older samples receive geometrically decreasing weights in ((T^). The larger 
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the discounting coefficient r is, the smaUer the weights of the older samples become; thus we have stronger 
discounting effect. Moreover, we obtain the discounted ML estimator of the variance ft as foliows: 



ft :=argmax J]^ P\Xj\x'-Jf, ■.aj,a 



* - *o ■ til ' * - *0 ' 

where we define e| = (^Xj — ajxj^ and St '■= X]*=to+i^i' Clearly when the discounted estimator of the 
AR coefficient dj is available, St can be computed in a sequential manner. 

In the sequel, we first describe how to efficiently compute the AR estimator dj. Finally we derive the 
SDNML density function using the discounted ML estimators {at,Tt)- 

The AR coefficient dj can simply be computed by solving the least-squares problem (jl2p . It can, however, 
be obtained more efficiently using the iterative formula described in [161 [T^ . Here we repeat the formula 
for the discounted version presented in [3]. First define the sufficient statistics Vt G 'Rp^'p and Xt ^ 
follows: 

t t 
Vt := ^ WjXjx] , Xt ^i^i^J- 

J=to + l J=to + l 

Using the sufficient statistics, the discounted AR coefficient dj from p2)) can be written as follows: 

at = Vt^^Xf 

Note that Xt can be computed in a sequential manner. The inverse matrix V^"^ can also be computed 
sequentially using the Sherman-Morrison- Woodbury formula as follows: 

.r-i_ 1 ,r-i_ r V,~'xtxJV,-' 
1 — r 1 — r 1 — 7'-|-Ct 

where Ct — rxJVt^^Xt. 

Finally the SDNML density function is written as follows: 

PsDNML{Xt\x ) 



1 ~(t-*o)/2 



where the normalization factor Kt{x* ) is calculated as follows: 



with dt ^ Ct/{1 - r + Ct). 
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(a) Link-anomaly-based change-point analysis. Green: Aggre- (b) Link-anomaly-based burst detection. Blue: Aggregated 
gated anomaly score JSj at r = 1 minute. Blue: Change-point anomaly score JSj at t = 1 second. Cyan; threshold for the 
score Red: Alarm time. filtering step in Kleinbcrg's burst model. Red: Burst state. 
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(c) Keyword-frequency-based change-point analysis. Blue: Pre- (d) Keyword-frequency-based burst detection. Blue: Frequency 
quency of keyword "Job hunting" per one minute. Red: Alarm of keyword "Job hunting" per one second. Red: Burst state 
time. (burst or not). 



Figure 3: Result of "Job hunting" data set. The initial controversial post was posted on 22:50, Jan 08 (green 
lines in (b) and (d)). 



15 




ilMlllU 




Nov04 1 1 :00 Nov04 1 7:00 Nov04 23:00 Nov05 05:00 Nov05 1 1 :00 

Time 



Nov04 1 1 :00 Nov04 1 7:00 Nov04 23:00 Nov05 05:00 Nov05 1 1 :00 

Time 



(a) Link-anomaly-based change-point analysis. Green: Aggre- (b) Link-anomaly-based burst detection. Blue: Aggregated 
gated anomaly score ^ at r = 1 minute. Blue: Change-point anomaly score ([Sj at r = 1 second. Cyan: threshold for the 
score Red: Alarm time. filtering step in Kleinberg's burst model. Red: Burst state. 





Nov04 1 1 :00 Nov04 1 7:00 Nov04 23:00 Nov05 05:00 Nov05 1 1 :00 

Time 



Nov04 1 1 :00 Nov04 1 7:00 Nov04 23:00 Nov05 05:00 NovOS 1 1 :00 

Time 



(c) Keyword-frequency-based change-point analysis. Blue: Fre- (d) Keyword-frequency-based burst detection. Blue: Frequency 
quency of keyword "Senkaku" per one minute. Red: Alarm time, of keyword "Senkaku" per one second. Red: Burst state (burst 

or not). 



Figure 4: Result of "Youtube" data set. The first post about the video leakage was posted on 23:48, Nov 04 
(green lines in (b) and (d)). 
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(a) Link-anomaly-based change-point analysis. Green: Aggre- (b) Link-anomaly-based burst detection. Blue: Aggregated 
gated anomaly score JSj at r = 1 minute. Blue: Change-point anomaly score JSj at r = 1 second. Cyan; threshold for the 
score (|9]l. Red: Alarm time. filtering step in Klcinbcrg's burst model. Red: Burst state. 
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Time Time 

(c) Keyword-frequency-based change-point analysis. Blue: Fre- (d) Keyword-frequency-based burst detection. Blue: Frequency 
quency of keyword "arsenic" per one minute. Red; Alarm time, of keyword "arsenic" per one second. Red; Burst state (burst 

or not). 



Figure 5: Result of "NASA" data set. The initial post predicting NASA's finding about arsenic-eating 
organism was posted on 22:30, Nov 30 much earlier than NASA's official press conference at 04:00, Dec 03. 
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(a) Link-anomaly-based change-point analysis. Green: Aggre- (b) Link-anomaly-based burst detection. Blue: Aggregated 
gated anomaly score ^ at r = 1 minute. Blue: Change-point anomaly score ([Sj at r = 1 second. Cyan: threshold for the 
score (|9]l. Red: Alarm time. filtering step in Kleinberg's burst model. Red: Burst state. 
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(c) Keyword-frequency-based change-point analysis. Blue: Fre- (d) Keyword-froquency-based burst detection. Blue: Frequency 
quency of keyword "British" per one minute. Red: Alarm time, of keyword "British" per one second. Red: Burst state (burst 

or not). 



Figure 6: Result of "BBC" data set. The first post about BBC's comedy sliow was posted on 17:08, Jan 21 
(green lines in (b) and (d)). 
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